Two phase feature-ranking for new soil dataset for Coxiella burnetii persistence and classification using machine learning models

Coxiella burnetii (Cb) is a hardy, stealth bacterial pathogen lethal for humans and animals. Its tremendous resistance to the environment, ease of propagation, and incredibly low infectious dosage make it an attractive organism for biowarfare. Current research on the classification of Coxiella and features influencing its presence in the soil is generally confined to statistical techniques. Machine learning other than traditional approaches can help us better predict epidemiological modeling for this soil-based pathogen of public significance. We developed a two-phase feature-ranking technique for the pathogen on a new soil feature dataset. The feature ranking applies methods such as ReliefF (RLF), OneR (ONR), and correlation (CR) for the first phase and a combination of techniques utilizing weighted scores to determine the final soil attribute ranks in the second phase. Different classification methods such as Support Vector Machine (SVM), Linear Discriminant Analysis (LDA), Logistic Regression (LR), and Multi-Layer Perceptron (MLP) have been utilized for the classification of soil attribute dataset for Coxiella positive and negative soils. The feature-ranking methods established that potassium, chromium, cadmium, nitrogen, organic matter, and soluble salts are the most significant attributes. At the same time, manganese, clay, phosphorous, copper, and lead are the least contributing soil features for the prevalence of the bacteria. However, potassium is the most influential feature, and manganese is the least significant soil feature. The attribute ranking using RLF generates the most promising results among the ranking methods by generating an accuracy of 80.85% for MLP, 79.79% for LR, and 79.8% for LDA. Overall, SVM and MLP are the best-performing classifiers, where SVM yields an accuracy of 82.98% and 81.91% for attribute ranking by CR and RLF; and MLP generates an accuracy of 76.60% for ONR. Thus, machine models can help us better understand the environment, assisting in the prevalence of bacteria and decreasing the chances of false classification. Subsequently, this can assist in controlling epidemics and alleviating the devastating effect on the socio-economics of society.

www.nature.com/scientificreports/ silt, macro and micro-nutrients like carbon, phosphorous, sodium, potassium, sulfate calcium, and magnesium play an essential part in the prevalence of various pathogens like C. burnetii, F. tularensis, Burkholderia mallei, etc. Results also suggest that the soil reseves as a reservoir for the prevalence and further dispersion of pathogens in the environment. Generally, soil pH is essential in shaping bacterial communities in soils. Previous studies demonstrate that low pH is vital for the metabolic activity of C. burnetii. Results also suggest that the soil is a reservoir for the prevalence and further dispersion of pathogens in the environment. Generally, soil pH is essential in shaping bacterial communities in soils. Previous studies demonstrate that low pH is vital for the metabolic activity of C. burnetii 27 . Some works 8 suggest that factors, such as soil moisture and vegetation, are relevant to the prevalence of C. burnetii. It is further reported 28 that hot and dry conditions mainly help windborne dispersion of C. burnetii aerosols. Though, there is only a little work presented for classifying pathogens in soil-related environments using machine-learning techniques, except for our previous works. In our initial work 29 , we applied artificial neural networks to classify F. tularensis (Ft) using the soil attribute dataset. The method attained an accuracy of 82.61% with the help of 1 hidden layer with 10 neurons. The soil attribute dataset contains 147 instances for Ft negative and positive sites. Each instance contains 21 features and a class attribute. In our next work 30 , we further improved the accuracy to 84.35%. We applied feature ranking to identify the features that are most related, e.g., clay, nitrogen, zinc, nickel, organic matter, soluble salts, silt, and those that are least related, e.g., potassium, phosphorous, iron, calcium, copper, chromium, sand towards the survival of the pathogen. Table 1 gives an overview of various statistical and machine-learning approaches applied to assess the role of environmental features in the prevalence of different pathogens. The work focuses on feature ranking and classification utilizing various machine-learning techniques. The automatic classification of C. burnetii, along with identifying the most relevant features that help it prolong environmental survival, employing machine learning models can yield more reliable, accurate, and standardized results. Our work contribution can be summarized as shown below: 1. We present a novel soil attribute dataset for Coxiella positive and negative sites containing 21 soil features. 2. To the best of our knowledge, it is the first time our research has applied machine learning models instead of contemporary statistical models for understanding the behavior of C. burnetii in the environment. 3. Our model performs a two-phase feature ranking. Initially, attributes are ranked based on feature-ranking methods, and then a combination of techniques is applied to calculate the weighted scores to determine the final soil attribute ranks. 4. The model also compares the performance of feature-ranking algorithms and machine learning classifiers. www.nature.com/scientificreports/ 5. Our model performs classification and identifies the most relevant features that help prolong the pathogen's survival in the environment with a high classification accuracy of up to 82.98%. 6. We apply 10-fold cross-validation to establish the performance of the proposed method.

Material and methods
The research concentrates on a comparative study of different state-of-the-art machine learning techniques employed in various fields for classification and feature ranking in a unique soil attribute dataset for Coxiella +Ve and -Ve sites. Further, we compare the performance of state-of-the-art feature ranking models and classifiers. Lastly, we propose a machine-learning model for the classification of Coxiella using soil attribute data, as exhibited in Fig. 1.

Coxiella soil attribute dataset acquisition.
Approximately 500-800 g of soil sample was taken from C. burnetii positive (n=47) and negative (n=47) sites using a portable electronic balance. The dataset contains 21 chemical and physical soil features, such as maximum soluble salt, organic matter, silt, clay, and micro and macro-nutrients. These physical and chemical soil features have different values, as shown in Table 2. The dataset is the property of the Institute of Microbiology, Veterinary and Animal Sciences University, Lahore, Pakistan 22 .  Feature selection. In order to assemble an efficient and accurate model that would improve performance, data filtering is essential. These types of models would allow us to extract the best set of attributes. Suppose 21 input features are extracted from the soil feature dataset. In this article, X dn = [X 1n , X 2n , . . . , X Dn ] represents the feature matrix with D column vectors, and x dn is a certain feature value (with d = 1, 2, 3, . . . D and n = 1, 2, 3, . . . N ; being D=21 and N=94 in the dataset).
Attribute selection models. An attribute selection model combines a search function to suggest new attribute subsets with an assessment criterion that scores different attributes 35 . The most suitable algorithm is the one that tests every possible subset of attributes and finds the best subset that minimizes the rate of error. However, this exhaustive search approach becomes computationally intractable in scenarios with more extensive feature spaces. The choice of evaluation metrics significantly affects the function. Various feature selection algorithms have been used, for example. ReliefF (RLF), correlation (CR), and OneR (ONR). As explained below, each feature selection algorithm has its own set of features: ReliefF. The algorithmn allocates suitable weight to each attribute using an instance-based learning approach. The values of the class are distinguished based on the feature's weight. These weights define feature rank, and those that attain a specific threshold are hand-picked to construct the final subset 36  Correlation. It is an algorithm that uses the filter method to select features. It uses a heuristic-based method, which measures the effectiveness of individual features to predict the class label along with the level of inter-correlation between them 38 . The attributes with lesser correlation should be avoided, along with redundant attributes, as they may highly correlate with one or many of the remaining attributes. The formula used to filter out the redundant, irrelevant attributes, which contribute to the poor class prediction, is given in the equation as under: where M P represents the heuristic merit of a feature subset P having j attributes, r cf is the mean attribute-class CR, and r ff is the average attribute-attribute inter-correlation.
OneR. ONR is one of the simple classifiers in weka. The classifier is generally used for nominal data values. In this technique, OneR can produce a set of classification rules depending on the significance of a single feature 39 . The method selects the feature with the least error rate as its "one rule" 40 . The number of instances that do not www.nature.com/scientificreports/ belong to the majority class of the related feature value contributes to the error rate. It helps produce a baseline for classification performance and can deliver more satisfactory results than many other refined approaches 30 . SVM. The SVM is a classifier that helps in multi-class classification problems. It draws a hyperplane that maximizes the separation margin between two classes and minimizes the error 41 . The model provides significant advantages such as the absence of local minimums, sufficient generalization to the new objects, and a representation that relies on a few parameters 42 . Given a training set of input vectors where ζ i penalises objective function for data samples that cross margins meant for that particular class and C b is the box constraint.

Machine learning classifiers.
Linear discrimination analysis. The classifier is used for preprocessing in machine learning applications, pattern classification, and LDA. The purpose of the model is to minimize lower dimensional space with optimized class separability and minimize computational cost 43 .
Logistic regression. LR is a variation of the traditional regression approach. It is applied when the dependent variable is binary in nature 44 . Like other regression models, it is also a predictive analysis model, which interprets data and explains the association between one dependent variable and one or more nominal, ordinal independent variables. In this approach, the dependent variable is the probability that an event may occur; therefore, the resulting value has a discrete number of responses, restrained between 0 and 1. It can be shown as follows: Where P( x) is the probalility of a specific output event, x 1 , x 2 , . . . , x n is an input vector equal to the independent predictors or variables, and f ( x) is the LR prototype.
Multi-layer perceptron. MLP is a complement of a feed-forward neural network. It comprises three kinds of layers-an input, output, and a hidden layer, as illustrated in Fig. 2. The input layer acquires the input data for processing. The out layer performs the essential task of classification and prediction. A number of hidden layers are the real computation engine of the design, which reside between the input and output layer of the MLP. An MLP uses backpropagation, a technique through which the weights in a neural network are optimized. The MLP approximates any continuous function and resolves tasks that are not linearly separable. It usually performs

Experiments
Data description. The experiments are conducted using the C. burnetii soil feature dataset, consisting of 94 specimens. Each specimen comprises 21 soil features. We need a supervised dataset to formulate a predictive model using classification techniques. So, the next step is to allocate suitable labels to every instance in the dataset. Thus, for +Ve and -Ve C. burnetii soil samples, class labels "1" and "0" were assigned, respectively.
Software tools. Weka is employed to train and test the C. burnetii dataset on various soil features 45 . First, we saved the details of the soil attribute dataset for C. burnetii in a CSV file and then opened the file in Weka's GUI interface. Second, we ranked these soil features using various feature selection methods. Third, we selected a classification algorithm and then calculated its accuracy by selecting top-ranked attributes one by one from the list using a nested subset approach. For some classifiers, Matlab libraries are employed during experimentation.
Performance evaluation. The soil dataset is utilized to test and train the model using various machine learning classifiers by applying a 10-fold cross-validation approach. The approach randomly divides the dataset into ten subsets of the same size, where each part has nearly an identical class distribution. Each subset is applied one by one as a test dataset, while the remaining subsets of the split serve as a training set. At each step, the model's accuracy is calculated, and the results of all outcomes are averaged to generate the final accuracy.

Results
The current section presents the experimental results of the features-ranking models and compares their performance against different machine learning classifiers. Various algorithms are used for classification: SVM, LDA, LR, and MLP. A 10-folds cross-validation is applied to access the performance better and avoid overfitting. Firstly, the features of the C. burnetii dataset are ranked using three feature-ranking models. Table 3 illustrates the ranking for different feature-ranking algorithms, like CR, ONR, and RLF. The column "Attribute Index"     Table 3, the following conclusions can be drawn: Similarly, Table 3 shows that out of the 9 least-significant features, 6 features, i.e.{Pb, MO, P, Cu, Mn, cy} , are recurring among all ranking methods.
Secondly, we perform a two-phase feature ranking to determine the contribution of each attribute toward the persistence of C. burnetii in soil. Initially, attributes are ranked based on feature-ranking methods, and then a combination of techniques is applied to calculate the weighted score to determine the final soil attribute rank. The top-ranked and least-ranked attributes are displayed separately in Tables 4 and 5. These tables show each feature ranking method's scores and the final aggregate score of each soil attribute for the C. burnetii dataset. The aggregate score is the sum of the scores of all the attribute ranking methods. If the aggregate score is on the lower side, higher would be the rank of an attribute. Similarly, if the score is on the higher side, the lower would be the rank of the attribute.
The first row depicts {K} ranked 2nd by RLF and ONR, 4th by CR, and the last column shows its aggregate score of 8, which is the sum of scores of all the attribute ranking methods, i.e.(2 + 2 + 4 = 8). The second row shows that {Cr} is ranked 1, 4, and 8 by RLF, ONR, and CR, respectively, with an aggregate score of 13. Similarly, the last row shows that {Mg} is ranked 8, 12, and 9 by RLF, ONR, and CR, respectively, with an aggregate score of 29. Now {K} is the top ranked attribute, as its aggregate score, i.e.(8) is minimum, {Cr} 2 nd top ranked attribute with an aggregate score of 13. Similarly, the results in the Table 5 Fig. 3 reflect that Potassium (K) is the most significant attribute, where K is ranked 2 nd by RLF and ONR, 4 th by CR, so its aggregate score is 8, which is the sum of scores of all the attribute ranking methods, i.e. (2+2+4=8). Similarly, the least-ranked features are shown in Fig. 4, which portrays that Mn is the least significant attribute with a ranking score of 56, which is the sum of individual feature scores of 16, 21, and 20 for RLF, ONR, and CR, respectively.  www.nature.com/scientificreports/ Thirdly, we evaluated the performance of these feature-ranking methods to different machine learning classifiers. The result of the experiments is shown in Table 6. For every feature-ranking technique, the row "rk" illustrates the ranking sequence of attributes. Then the table presents the results of classifiers ( MLP, LR, LDA, and SVM) according to the ranking sequence of each feature-ranking model. The accuracy ranges from 82.98% (SVM) to 53.19% (SVM) while applying various ranking models and classification techniques. The most relevant feature for CR is {N} . Using this feature, SVM, LDA, and LR produce a classification accuracy of 63.91%, 62.9%, and 62.89% for CR, respectively. The most relevant feature for ONR is {Cd} , and RLF is {Cr} . Using Cd(ONR), LDA generated an accuracy of 59.6%, and Cr(RLF), SVM produces an accuracy of 57.45%. We can infer various conclusions from the analysis of Table 6: (a) The three attribute-ranking models deliver distinct rankings, which generate different classification outcomes.  Figure 5 shows the accuracy of machine learning models using CR as attribute-ranking technique. Although the feature subset is similar, LDA performance is better than other classifiers for initial-level features. However, SVM shows excellent results for mid-level features. All the classifiers display a considerable decrease in accuracy for the last set of features. The results show that SVM generates a classification accuracy of 82.98%, which is far better than other models. So, the overall performance of SVM is far better than other machine learning classifiers. Figure 6 represents accuracy curves for classification algorithms using the RLF feature-ranking technique. Although all the classifiers show a similar trend, SVM and MLP achieve a classification accuracy of 81.91% and 80.55% higher than any other classification method. All the classifiers shows similar trend for initial set of features. However, LDA and MLP seem to perform better than other classifiers. But, for mid-level features, LDA and MLP stand close to SVM. Nevertheless, the overall performance of SVM is better than other classifiers. Figure 7 illustrates the accuracy of classification models using ONR as an attribute-ranking technique. However, all the classifiers show a similar trend for a nested subset of soil features except MLP, which shows a sharp increase for mid-level features. Although LR and LDA show better results for the initial features, SVM outperforms other classifiers for the last subset of features.
In summary, the results propose that: (a) 6 features that significantly contribute towards the persistence of the pathogen in the environment are {K, Cr, Cd, N, OM, SS} (b) 5 least contributing features for Coxiella are {Mn, cy, P, Cu, Pb} . c) Feature ranking using RLF generates better results for all machine learning algorithms than other feature-ranking models.  Table 6. A Comparison of results from various Feature-ranking methods against different Machine learning classifiers using C. burnetii dataset. 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21   CR   rk  8  4  3  21  18  1  17  12  10  13  20  2  19  7  5  11  6  16  15  9   Comparision with previous machine learning approaches. Although some researchers applied machine learning to classify soil-borne pathogens like F. tularensis and the environment that help its persistence in soil, there needs to be more data available specifically for C. burnetiia in the soil-related environment, as shown in Table 7. Furthermore, our model applies a two-phase feature-ranking on a novel C. burnetiia dataset, contrary to previous works.

Discussions.
Machine learning models are used as a standard in different disciplines, for example, soil classification 46 , medical science 47 , bio-informatics 48 , and agriculture [49][50][51] . Our research reveals that machine learning models, instead of contemporary statistical models show exceptional results for classifying C. burnetii and understanding the pathogen's behavior in the soil-related environment.
The results propose that potassium, chromium, cadmium, nitrogen, organic matter, and soluble salts are the top 6 most significant features for the persistence of the Coxiella, as exhibited in Table 4. Previous works also propose that abiotic characteristics such as pH, organic matter, and soil nutrients, are not only the driving force for soil bacterial community [52][53][54][55] but are also positively linked with the persistence of soil pathogens [56][57][58] . Some recent studies 9,22,23 also highlight the significance of soil's physiochemical characteristics, like organic matter, soluble salt, nitrogen, clay, potassium, cobalt, chromium, and cadmium, etc., for the sustenance of B. anthracis, C. burnetii and F. tularensis.
Our analysis further reveals that potassium is the most noteworthy feature for the presence of Coxiella in soil. Some fantastic works 9,22,23,59 prove that the prevalence of various pathogens like B. anthracis, C. burnetii, and F. tularensis positively correlates to potassium in the soil. The next essential features that improve the likelihood of persistence of the pathogenic bacteria are chromium, cadmium, nitrogen, organic matter, and soluble salts. Studies 23,[56][57][58] in the recent past indicate that the presence of organic matter and chromium is helpful for persistence of pathogens in the soil. Another study 60 reveals that nitrogen is essential for the sustenance of pathogens within their plant and animal hosts. A study 22,61 suggests that cadmium, nitrogen, soluble salts and organic matter positively correlate with the prevalence of F. tularensis in soil. The prevalence of B. anthracis is also associated with the presence of organic matter, chromium, and potassium in soil 23 . Recent works 9,59 highlight that soluble salts is positively correlated with the persences of C. burnetii and F. tularensis. Similarly, a work 61 provides evidence that nitrogen and organic matter are helpful in the persistance of C. burnetii and another research also illustrates that nitrogen and organic matter are also positively related to the sustenance of a nitrogen-fixing bacteria called A. brasilense 62 .
The remaining contributing features from Table 4 are sodium, pH, nickel, and magnesium. Previous researches [53][54][55] show that soil texture, pH, and nutrients are essential for bacterial communities. Our results conform with a recent study 23 that reveals that features like magnesium, potassium, and sodium are positively correlated to C. burnetii in soil-related environments. Another work 9 also shows a substantial difference between Coxiella negative and positive sites with reference to magnesium and sodium. A study 30 also reveals that F. tularensis has a positive affinity with souble salts, nickel, and pH for its existence in soil. Another research 59 reveals that soluble salts and nickel positively contribute towards the presence of F. tularensis. Magnesium plays a substantial part in the persistence of microbes during starvation and cold shocks 63 . A work 25 illustrates magnesium, sodium, potassium, and sulfate are conducive to F. tularensis growth in soil and water.
Our study also depicts that silt, moisture, and cobalt fall in the middle. Previous research reveals that silt possesses substantial organic matter due to the rise in surface area compared to the sandy portion, which may augment the possibility of the prevalence of pathogens 64 . Another research 23 shows that the persistence of C. burnetii is associated with higher concentration of cobalt in the enviornment. A study 22 reveals that the persistence of F. tularensis is positively correlated to the presence of silt in soil. Another work 65 proposes that F. tularensis has a great affinity to moisture and low temperature.
Our machine learning analysis reveals that the least contributing seven features are manganese, clay, phosphorous, copper, lead, sand, and calcium as shown in Table 5. A recent research 9 also substantiates our viewpoint by exhibiting no significant difference between Coxiella negative and positive sites regarding manganese, phosphorous, clay, lead, copper, and sand in the soil. A study 22 also reveals that manganese, phosphorous, calcium, copper, and sand do not show any positive affinity with F. tularensis in soil. Similar research 59 also reveals www.nature.com/scientificreports/ that clay, phosphorous, copper, lead, sand, and calcium are not positively correlated with F. tularensis. Some suggest 66 that during hot and dry weather, high manganese contents are seen in B. pseudomallei positive sites as appose to negative sites. However, others 67 believe that the aerobic heterotrophic population of microbes is very susceptible to different minerals, like cadmium, nickel, manganese, mercury, chromium, copper, and zinc. An analysis also reveals that manganese and zinc are essential for biological processes, and they exist as protein components in many species 67,68 . Some works propose that zinc helps in multiple cellular functions, like pH regulation, metabolism, bacterial gene expression, DNA replication, glycolysis, synthesis of Amino acids, and processes as a cofactor of microbial virulence 69 . However, the surplus amount of zinc can cause toxicity; thus, these microbes possess a mild structure to maintain zinc's equilibrium for executing crucial cellular functions and abstain from the damages it may cause 70 .
Classification outcomes of C. burnetii in soil employing different machine learning techniques reveal that SVM surpasses all other machine learning models by generating an accuracy of 82.98% utilizing the initial 14 top-ranked features.

Conclusion
The soil texture, physical and chemical factors play an important role in the growth and survival of bacteria. Thus, their relationship with C. burnetii is investigated in this study. The recent machine learning models can help us better understand the association of microbes with various soil features. The research presents the classification and feature-ranking of the pathogen using a soil feature dataset. Potassium is the top-ranked attribute, followed by chromium, cadmium, nitrogen, and organic matter. However, manganese, clay, phosphorus, and copper are the least contributing features. The RLF shows the best result for most of the ranking algorithms. SVM produces the best accuracy of 82.98% for the initial 14 soil features {N, OM, SS, K, Na, pH, Cd, Cr, Mg, Ni, Ca, MO, Fe, Si} , using CR. In contrast, like SVM and MLP generate accuracies of 81.91%, and 80.85%, respectively for RLF. These machine learning models can also help us better understand the contribution of various soil features towards the survival of the pathogenic bacteria in the environment.

Future works
Various pathogens behave differently in the environment due to variations in their cell structure. Some of these pathogens are highly resistant to environmental factors and can survive in the environment for years. Understanding how these pathogens behave in different environmental conditions is crucial for the research community to predict future outbreaks. So machine learning models can significantly help in achieving this task. In our previous works, we tried to classify and learn how F. tularensis behaves in the environment. Our current work focuses on the classification of C. burnetii and how it behaves in the environment. In the future, we intend to expand this work for other pathogens to devise a comprehensive model that could help us in predicting various disease outbreaks by these pathogens.